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Change-point analysis is a flexible and computationally tractable tool for the analysis of times se¬ 
ries data from systems that transition between discrete states and whose observables are corrupted 
by noise. The change-point algorithm is used to identify the time indices (change points) at which the 
system transitions between these discrete states. We present a unified information-based approach 
to testing for the existence of change points. This new approach reconciles two previously disparate 
approaches to Change-Point Analysis (frequentist and information-based) for testing transitions be¬ 
tween states. The resulting method is statistically principled, parameter and prior free and widely 
applicable to a wide range of change-point problems. 


I. INTRODUCTION 

The problem of determining the true state of a system that transitions between discrete states and whose observ¬ 
ables are corrupted by noise is a canonical problem in statistics with a long history (e.g. [1]). The approach we 
discuss in this paper is called Change-Point Analysis and was first proposed by E. S. Page in the mid 1950s [2, 3]. 
Since its inception, Change-Point Analysis has been used in a great number of contexts and is regularly re-invented 
in fields ranging from geology to biophysics [1, 4, 5]. 

The primary goal of this paper is to develop a new information-based approach to Change-Point Analysis which 
simplifies its application in problems, including those where a specific change-point statistics have not been com¬ 
puted. We approach Change-Point Analysis from the perspective of Model Selection and Information Theory. Akaike 
pioneered a powerful approach to Model Selection by the minimization of the Kullback-Leibler Divergence [6], a 
measure of information loss by approximating the true process with a model [7, 8]. He demonstrated that two key 
principles of modeling, predictivity and parsimony, were in fact conceptually and mathematically linked (e.g. [8]). In 
short, the addition of superfluous parameters to a model, reducing parsimony, results in information loss, reducing 
predictivity (e.g. [8]). Akaike derived an unbiased estimator for information loss, the Akaike Information Criterion 
(AIC), which proved to be at once exceptionally tractable and widely applicable. 

Unfortunately Akaike's approach is limited to regular models [9]. Change-Point Analysis and many other appli¬ 
cations are singular. These models contain unidentifiable parameters with nearly zero Fisher Information, which 
greatly increase the complexity of the model and lead to the catastrophic failure of AIC to estimate information loss. 
The subject of this paper is the implementation of information-based model selection in the context of Change-Point 
Analysis. We have recently proposed a Frequentist Information Criterion (FIC) applicable even in the context of 
singular models. Using FIC and an approximation analogous to that used by Akaike to derive AIC, we develop a 
model criterion that accounts for the unidentifiability of the change-point indices. Importantly, this criterion does 
not depend on the detailed form of the model for the individual states but only on the number of model parame¬ 
ters, in close analogy with AIC. Therefore we expect this result to be widely applicable anywhere the change-point 
algorithm is applied. 

Frequentist statistical tests have already been defined for a number of canonical change-point problems. It is there¬ 
fore interesting to examen the relation between this approach and our newly-derived information-based approach. 
We find the approaches are fundamentally related. The information-based approach can be understood to provide 
an predictively-optimal confidence level for a generalized ratio test. The Bayesian Information Criterion (BIC) has 
also been used in the context of Change-Point Analysis. We find very significant differences between our results and 
the BIC complexity that suggest that BIC is not suitable for application to change-point analysis since it can lead to 
either over or under segmentation of the data, depending on the specific context. 
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FIG. 1: Panel A: State model schematic. A state model for biophysical applications is parameterized by four model parameters 
that are written as the vector 6 = {k, e, /r, a). Above we schematically illustrate the role of each parameter in shaping the signal. 
Panel B: Schematic of binary segmentation. To segment a partition, the information change due to placing a change point at 
each time index along the x axis is considered. (The dashed red and green lines represent possible change points.) For each 
change-point index, a minimum-information fit is performed on the two resulting data partitions (top panel, blue dots), resulting 
in the solid curves (top panel, red and green for the respective change-point indices). For each change-point index, an information 
change is computed (bottom panel). The change point is placed at the time index that minimizes the information change (red 
dashed line). 


II. PRELIMINARIES 


We introduce the following notation for a signal: a sef of N ordered observafions from a one-dimensional sfochasfic 
process^: 

= ( 1 ) 

where fhe observafion index is offen buf nof exclusively femporal and fhe probabilify disfribufion for fhe sfochasfic 
process is represenfed as p. We shall represenf fhe probabilify disfribufion for fhe model M as: 

q{X^\M), (2) 

where fhese is no guaranfee fhaf frue disfribufion is a member of fhe model family. 

Information and cross entropy. The coding information for signal X^ given model M is: 

h{X^\M) = -\ogq{X^\M), (3) 

and fhe cross enfropy for fhe signal (average coding informafion) is: 

H^{M) = ¥.xh{X^\M), (4) 

p(') 


where fhe expecfafion over fhe signal X^ is understood fo be faken over fhe frue disfribufion p. 

The Change-Point Model. We define a model for fhe signal corresponding fo a sysfem fransifioning between a 
sef of discrefe sfafes. We define fhe discrete fime index corresponding fo fhe sfarf of fhe Ith sfafe ij. This index is 
called a change point. The model parameters describing fhe signal in fhe /fh inferval are 0/. Togefher fhese two sets 
of parameters (i/ and 9j) parameterize fhe model . The model parameferizafion for fhe signal (including mulfiple 
sfafes) can fhen be wriffen explicifly: 


f 1 *2 ■ • ■ in \ 

ye, 02 ... 


( 5 ) 


^ When X appears in upper case, it should be understood as a random variable whereas it is a normal variable when it appears in lower case. If 
we need a statistically independent set of variables of equal size, we will use the random variables , which have identical properties to the 
X^. 
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where n is the number of states or change points. A schematic example of a change-point mode for a biophysical 
signal is shown in Figure 1. The two sets of parameters {Oj and ij) are fundamentally different. We shall assume 
that the state model is regular: i.e. the parameters 6 j have non-zero Fisher information [10]. By contrast, the change- 
point indices ij are discrete and t 5 rpically non-harmonic parameters. For instance, consider a true model p = q where 
61 — 62 - In this scenario the cross entropy will be independent of *2 as long as *2 G ( 11 , 13 ). The Fisher information 
corresponding to 12 is threfore zero. These properties have important consequences for model selection [10]. 

Determination of model parameters. Fitting the change-point model is performed in two coupled steps. Given a 
set of change-point indices = (ii,..., i„), the maximum likelihood estimators (MLE) of the state model parameters 
0 " = {9i ,..., On) are defined: 


= argminh(A^|Jt”). ( 6 ) 

The determination of the change-point indices is a nontrivial problem since not only are the change-point indices 
unknown, but even the number of transitions (n) is unknown. 

Binary Segmentation Algorithm. To determine the change-point indices, we will use a binary-segmentation algo¬ 
rithm that has been the subject of extensive study (e.g see the references in [4]). In the global algorithm, we initialize 
the algorithm with a single change point ii = 1. The data is sequentially divided into partitions by binary seg¬ 
mentation. Every segmentation is greedy: i.e. we choose the change point on the interval (1, tV) that minimizes the 
information in that given step, without any guarantee that this is the optimum choice over multiple segmentations. 
The family of models generated by successive rounds of segmentation are said to be nested since successive changes 
points are added without altering the time indices of existing change points. Therefore, the previous model is al¬ 
ways a special case of the new model. The binary segmentation process is shown schematically in Eig. 1, Panel B. 
In each step, after the optimum index for segmentation is identified, we statistically test the change in information 
(due to segmentation) to determine whether the new states are statistically supported. The change-point determined 
by binary segmentation determine the change-points in the MLE model M'^. The local binary-segmentation algo¬ 
rithm differs from the global algorithm only in that we consider the binary segmentation of each partition of the data 
independently. The algorithms as described explicitly in the supplement. 

Information-based model selection. The model that minimizes the cross entropy (Eqn. 4) is the most predictive 
model. Unfortunately, the cross entropy cannot be computed since the expectation cannot be taken with respect 
to the true but unknown probability distribution p in Eqn. 4. The natural estimator of the cross entropy is the 
information (Eqn. 3), but this estimator is biased from below: Due to the phenomena of over-fitting, added model 
parameters always reduce the information (or equivalently the training error) even as the predictivity of the model is 
reduced by the addition of superfluous parameters. We must therefore construct an unbiased estimator of the cross 
entropy which we call the information criterion: 

lCiX^,n) = h{X^\M]^) + 3C{n), (7) 

where 3ft is the complexity of the model which is defined as the bias in the information as an estimator of cross¬ 
entropy: 


3fC{n) = - /i(A^|itJ)} , ( 8 ) 

where the expectations are taken with respect to the true distribution p and X^ and are independent signals. 
Eor a regular model in the asymptotic limit, the complexity is equal to the number of model parameters and the 
information criterion is equal to AIC. In the context of singular models, a more generally applicable approach must 
be used to approximate the complexity. 

Frequentist Information Criterion. The Erequentist Information Criterion (EIC) uses a more general approximation 
to estimate the model compleixty. Since the true distribution p is unknown, we make a frequentist approximation, 
computing the complexity for the model M as a function of the true parameterization: 

3fCFic{M^,n) = Ex,f {hiY^lM'f) - h{X^\M^)] , (9) 

and the corresponding information criterion is defined: 

FIC(A^,n) = h{X^\M^) + 3ftFic{M^,n), 


( 10 ) 
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where the complexity is evaluated at the MLE parameters M\. The model that minimizes FIC has the smallest 
expected cross entropy. 

Approximating the FIC complexity. The direct computation of the FIC complexity (Fqn. 9) appears daunting, but a 
tractable approximation allows the complexity to be estimated. The complexity difference between fhe models is: 

A{n) = SCpicin) — 3CFic{n — 1), ( 11 ) 


which is called fhe nesting complexify. An approximafe piecewise expression can be compufed as follows. Lef fhe 
observed change in fhe MLE informafion for fhe nfh nesfing be 

Ahri = h{X^\Mx) - ( 12 ) 

where n denofes fhe nfh nesfing of model M. Consider fwo limifing cases: When fhe new paramefers are identifiable, 
lef fhe nesfing complexify be given by whereas when fhe new paramefers are unidentifiable, lef fhe nesfing 
complexify be given by . When fhe new paramefers are idenfifiable, fhe model is essentially regular fherefore: 

%+ = d, (13) 

where d is fhe number of harmonic^ paramefers added fo fhe model in fhe nesfing procedure, as predicfed by AIC. 

To compufe fo-, we assume fhe unnesfed model is fhe frue model and compufe fhe complexify difference in 
Fqn. 11. We fhen apply a piecewise approximation for evaluating fhe nesfing complexify [10]: 


%{n) 


A-{n), —Ahn < ft-{n) 
:fe+(n), otherwise 


(14) 


Since the nesting complexity represents complexity differences, fhe complexify can be summed: 


3?^fic(«) = y^fe(j), 

i=i 


(15) 


where fhe firsf ferm in fhe series, %{!) is compufed using fhe AIC expression for fhe complexify. An exacf analytic 
description of fhe complexify remains an open quesfion. 


III. AN INFORMATION CRITERION FOR CHAGNE-POINT ANALYSIS 


Complexity of a state model. As a first step towards computing the complexity for fhe change-poinf algorifhm, 
we will compufe fhe complexify for a signal wifh only a single sfafe. If will be useful fo break fhe informafion info 
fhe informafion per observation. Using fhe Markov properfy of fhe process, fhe informafion associafed wifh fhe ffh 
observation is: 

h,{x^\e) = -\ogq{x,\x,_^-e). (16) 

For a stationary process, the average information per observation is constant h = E h. The fluctuation in the infor¬ 
mation Shi = hi — h has the property that it is independent for each observations: 

ESh.Shj =CoSij, (17) 

where Cq is a consfanf and Sij is fhe Kronecker delfa, due fo fhe Markovian properfy. In close analogy fo fhe deriva¬ 
tion of AIC, we will Taylor expand fhe informafion in ferms of fhe model parameferizafion 0 around fhe frue param- 
eferizafion 9q. We make fhe following sfandard defrnifions: 


SO 

o 

III 

(18) 


= Xe^lh,{X^\0o), 

(19) 

I 

= ExVey^h^iX^ieo), 

(20) 

Xi 

= Vgh,iX^\eo), 

(21) 

X 

III 

M 

(22) 
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Harmonic parameters are parameter with sufficiently large Fisher information that they are not unidentifiable. 
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where 69 is the perturbation in the parameters, I and li are the Fisher Information and its estimator respectively. 
The subscript i refers fo fhe ifh observation. Nofe fhaf since fhe frue parameferizafion minimizes fhe informafion by 
definifion, E =0. Furfhermore, Eqn. 17 implies fhaf 


E, XiXj = ISij 


where I is fhe Fisher Informafion. The Taylor expansion of fhe informafion can fhen be written: 

h{x^\e) = h{x^\eo) + se'^x + Ue^Nise + Q{de^), 


(23) 


(24) 


fo quadratic order in 69. 

If is convenienf fo fransform fhe random variables Xi fo a new basis in which fhe Fisher Informafion is fhe identify 
This is accomplished by fhe fransformafion 


x[ = I 
9' = I^/^9, 

which resulfs in fhe following expression for fhe informafion: 


h{9\Xi) = h{X^\9o)+69'^X' + lN69'^69' + (3{69^). 


(25) 

(26) 

(27) 


In our rescaled coordinafe sysfem, X' can be inferprefed as an unbiased random walk of N sfeps wifh unif variance 
in each dimension. 

We defermine fhe MLE paramefer values: 


69^^ = -iX'. 


To compufe fhe complexify we need fhe following expecfafions of fhe informafion: 

Ex,Yh{Y^\9x) = Ex,f {h{Y^\9o) - j^X'^Y' + ^X'^ + Gi69^)Y 
Exh{X^\9x) = Ex.v {h{X^\9o) - ^X'^ +G{69^),Y 


(28) 

(29) 

(30) 


Since fhe signals X^ and Y^ are independenf, fhe second ferm on fhe RHS of Eqn. 29 is exacfly zero. If is sfraighf 
forward fo demonsfrafe fhaf 


ExX'/ = N<L, 


(31) 


where -d is fhe dimension of fhe paramefer 9, which has an infuifive inferprefafion as fhe mean squared displacemenf 
{X'^) of a unbiased random walk of N sfeps in -d dimensions. The complexify is fherefore: 


3C = Ex,y {hiY^\9x) - h{X^\9x)} = <i. 


(32) 


which is fhe AIC complexify. To compufe fhe complexify associafed wifh fhe firsf binary segmenfafion, we will 
compufe fhe nesting complexify fe(2) using Eqn. 14. We will fherefore generafe fhe observafions X^ and Y^ us¬ 
ing fhe unsegmenfed model M}. Remember fhaf by convention we assign fhe firsf change-poinf index fo fhe firsf 
observafion h = 1. The optimal buf fictitious change-poinf index for binary segmenfafion is: 


i2{X) = + 


(33) 


where fhe represenf fhe respecfive parfifions of fhe signal X^ made by fhe change poinf i. (Nofe fhaf in fhe 
case of an AR process, if is possible fo wrife overlapping parfifions fo accounf for fhe sysfem memory.) The MLE 
model for two sfafes is defined: 




1 ^ h 


(34) 


To compufe fhe nesting complexify, we compufe fhe difference in fhe informafion befween fhe fwo-sfafe and one- 
sfafe MLE models: 

h{x^\M\)-h{x^\M\) = min { 


l<i<N 

1 

Nh-i) 


Y''^ 1 _L 1 \ 

^ [l.i-1] 2(Af+l-i)^ [i.M 2N^ [i.M J ■ 


(35) 
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where the terms that are zeroth order in the perturbation cancel since the model is nested and are the X.' 

computed in the two partitions of the data. (This equation is analogous to Eqn. 30.) It is straightforward fo compufe 
fhe analogous expression for information difference for signal . The nesfing penally can fhen be written: 

= Ex max { }, 

where fhe cross ferms befween signals X^ and are zero since fhe signals are independent. It is now convenient 
to introduce a <i-dimensional discrete Brownian bridge: 

(38) 

by using the well known relation between Brownian walks and bridges [11]. The Brownian bridge has the property 
that Bq = B'j^ = 0, where each step has unit variance per dimension and mean zero. After some algebra, the nesting 
complexity can be written: 


(36) 

(37) 


%.i2)= Ex (39) 

The details of fhe slate model will determine the distribution function for fhe discrefe sfeps in fhe Brownian bridge, 
buf fhe Cenfral Limit Theorem implies that the distribution will approach the normal distribution. Therefore, if is 
convenienf fo approximafe fhe discrefe Brownian bridge B'^ as an idealized Brownian bridge wifh normally dis- 
fribufed sfeps: 


n 

Bj —)• Bj = ^ bi, such that B^ =0, (40) 

i=l 


where the bi are steps that are normally distributed with variance one per dimension -d and mean zero. We now 
introduce a new random variable U{N, -d), the ut-dimensional Change-Point Statistic [12,13]: 


U{N,-d) = 


1 

2 


max 

l<j<N 


N 

KN-j) 




(41) 


which is a -dimensional generalization of the change-point statistic computed by Hawkins [14]. In terms of the 
statistic U, the nesting penalty is 


■k_{2) = 2EuU{N,-d) = 2U{N,<L). (42) 

We will discuss the cormection to the frequentist LPT test shortly. 

Nesting complexity for n states. The generalization of the analysis to n states is intuitive and straightforward. In 
the local binary-segmentation algorithm, segmentation is tested locally. The relevant complexity is computed with 
respect to the length of the Jth partition. It is convenient to work with the approximation that all partitions are of 
equal length since the complexity is slowly varying in N. We therefore define the local nesting complexity 

feL-(n) = 2Ec/[/(£j,^) = 2t7(£j,<t), (43) 

where is the mean partition length. The nesting complexity for the binary segmentation of a single state is show 
in Fig. 2 for several different dimensions -d, and compared with the complexity predicted by AIC and BIC. 

In the global binary-segmentation algorithm, the next change-point is chosen by identifying the best position over 
all intervals. We therefore generalize all our expressions accordingly. We introduce a generalization of the Change- 
Point Statistic where we replace N with a vector of the lengths of the constituent segment lengths TV” = (Ai, ...A„). 
We now define our new change-point statistic: 

UG{N^,-d)= max U{N„-d). (44) 

l<z<n 

Because it is computationally intensive to compute Uq for all possible segmentations TV”, we assume that all the 
partitions are roughly the same size and consider n segments length N/{n — 1). Since the complexity is slowly 





7 



FIG. 2: Nesting complexity for AIC, FIC and BIC. The nesting complexity is plotted for three state dimensions d. = {1, 3, 6} 
and n = 2. First note that the AIC penalty is much smaller than the other two nesting complexities. BIC is empirically known 
to produce acceptable results under some circumstances. For sufficiently large samples {N), the i^Bic > Apic, resulting in over 
penalization and the rejection of states that are supported statistically. This effect is more pronounced for large state dimension <L 
where the crossover occurs for small observation number N. Abic is too small for a wide range of sample sizes, resulting in over 
segmentation. 


varying in N, this does not in general lead to significant information loss. We therefore infroduce anofher change- 
poinf sfafisfic: 


Aa-in) = 2Eu max 2Eu Ug{N^,< i)) (45) 

l<2<n 

fhaf we will apply in fhe global binary-segmenfafion algorifhm. 

Series expressions for the nesting complexity. It is straightforward fo compufe fhe asympfofic dependence of fhe 
nesfing penalfy on fhe number of observafions N: 

ka-in) « 21oglog^+ 21ogn + -dlogloglogf + ..., (46) 

kL-{n) « 2 log log ^log log log ^ + ... (47) 

These expression are slowly converging and in pracfice, we advocafe using Monfe Carlo infegrafion fo defermine 
fhe nesfing penalfy. If compufafionally cumbersome, Eqn. 46 and 47 are useful in placing our approach in relafion 
fo exisfing fheory. 

Bofh fhe local and fhe global encoding have fhe same leading-order 2 log log N dependence fhaf has been advo- 
cafed by Hannan and Quinn [15], alfhough inferesfingly nof in fhis confexf. In confrasf, fhis 2 log log N dependence 
is in disagreemenf wifh fhe Bayesian Informafion Criferion, which has offen been applied fo change-poinf analysis. 
As illusfrafed by Fig. 2, fhe BIC complexify: 


3fBic = flogA^, (48) 

can be eifher too large or too small depending on fhe number of observafions and fhe dimension of fhe model. If has 
long been appreciated fhaf BIC can only be sfricfly jusfified in fhe large-observafion-number limif. In fhis asympfofic 
limif, fhe BIC complexify is always larger fhan fhe FIC complexify due fo fhe leading order log N dependence which 
will fend fo lead fo under fiffing or under segmenfafion. If is clear from Fig. 2 fhaf large {N > 10®) may consfifufe 
much larger dafasefs fhan are produced in many applications. 

Global versus local complexity. We proposed two possible parameter encoding algorithms above that give rise 
two distinct complexities: fci,_ and kc-- Which complexity should be applied in the typical problem? For most 
applications, we expect the number of sfafes n fo be proporfional fo fhe number of observafions N. Doubling fhe 
lengfh of fhe dafaset will resulf in fhe observation of fwice as many change poinfs on average. The application of fhe 
local nesfing complexify clearly has fhis desired property since it depends on the ratio of N/n. If is fhis complexify 
we advocafe under mosf circumsfances. 

In confrasf fhe global nesfing complexify confains an exfra confribufions fo fhe complexify 2 log n. The reason 
is infuifive: In fhe global binary segmenfafion algorifhm, one picks fhe besf change poinf among n segmenfs and 
fheretore complexify musf reflecf fhis added degree of choice. Consequenfly a larger feafure musf be observed 
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to be above the expected background. The use of the global nesting complexity makes a statement of sfafisfical 
significance againsf fhe enfire signal, nof jusf againsf a local region. In fhe confexf of discussing fhe significance of 
fhe observafion of a rare sfafe fbaf occurs jusf once in a dafasef, fhe global nesfing complexify is fhe mosf natural 
metric of significance. 

Computing the complexity from the nesting complexity. To compute the FIC complexity, we sum the nesting 
complexities using Eqn. 15. For datasets with identifiable change poinfs, fhe FIC complexify is initially identical fo 
AIC: 


3^Fic(n) = n-d, (49) 

until the change in the information on nesting Ah < when FIC predicts that there is a change in slope in the 
penalty. The FIC, AIC, and BIC predicted complexities are compared with the true complexity for an explicif change- 
point analysis in Fig. 3, Panel C. It is immediately clear from fhis example fhaf FIC quanfifafively captures the true 
dependence of the penalty, including the change in slope at n = 4, exactly as predicted by the FIC complexity. As 
predicted, the AIC complexity is initially correct until the segmentation process must be terminated. At this point 
the complexity increases significantly with the result that the AIC complexity fails to terminate the segmentation 
process. In contrast, the BIC complexity is initially too large, but fails to grow at a sufficient pace to match the true 
complexity for n > 4. 


IV. THE RELATION BETWEEN EREQUENTIST AND INFORMATION-BASED APPROACH 

Consider the LPT test for the following problem: We propose the binary segmentation of a single partition. In 
the null h 5 rpothesis (Tto) is the partition is described by a single state (unknown model parameters Bq) and the 
hypothesis to be tested {Hi) is that the partition is actually sub-divided into two states (unknown change point and 
model parameters 9i and 62 ). We use the log-likelihood ratio as the test statistic: 

V{X^) = log = h{X^\M],) - h{X^\M\). (50) 

q{X \Mx) 

In the Neyman-Pearson approach to h 5 rpothesis testing, we assume the null h 5 rpothesis (1 state) and compute the 
distribution in the test statistic V. As before, we will expand the information around the true parameter values Bq- 
In exact analogy to Eqn. 35, we find that V and our previously defined statistic U identically distributed: 

Vr^U, (51) 

up to the approximations discussed in the derivation. Therefore we will simply refer to V as U. 

In the canonical frequentist approach we specify a critical test statistic value above which the alternative hy¬ 
pothesis is accepted, is selected such that the alternative h 5 rpothesis Hi is rejected given that the null hypothesis 
Hq is true with a probability equal to the confidence level 7 : 

l = Fuiu^), (52) 


where Fjj is the cumulative distribution of U. 

Therefore we can interpret both the information-based approach and the frequentist approach as making use of the 
same statistic U. In the frequentist approach, a confidence level ( 7 ) is specified to determine the critical value Uj with 
which to accept the two-state hypothesis. The information-based approach also uses the statistic U, but the critical 
value of the statistic (^_) is computed from the distribution of the statistic itself = 2U. The information-based 
approach chooses the confidence level that optimizes predictivity. 


V. APPLICATIONS 

In the interest of brevity we have not included analysis of either experimental data or simulated data with a signal- 
model dimension larger than one, but we have tested the approach extensively. Eor instance, we have applied this 
technique to an experimental single-molecule biophysics application that is modeled by an Omstein-Uhlenbeck pro¬ 
cess with state-model dimension of four [16]. We also applied the approach in other biophysical contexts including 
the analysis of bleaching curves, cell and molecular-motor motility [17]. 
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FIG. 3: Information-based model selection. Panel A: Nested models generated by a Change-Point Algorithm. Simulated data 
(blue points) generated by a true model with four states is fit to a family of nested models (red lines) using a Change-Point 
Algorithm. Models fit with 1 < n < 8 states are plotted. The fit change points are represented as vertical black lines. The number 
of states (n) in each fit model is shown in the top-left corner of each panel. The true model has four states and the fit model 
with four states is indicated with a dotted box. The models with five through eight states have superfluous states that are not 
present in the true model. Panel B: Four changes points minimizes information loss. Both the expectation of the information 
(red) and the cross entropy (green) are plotted as a function of the number of states n. The y-axis {h, information) is split to show 
the initial large changes in h as well as the subsequent smaller changes for 4 < n < 8. The cross entropy (green) is minimized 
by the the model that best approximates the truth (n = 4). The addition of parameters leads to an increase in cross entropy (less 
predictive) as a consequence of the addition of superfluous parameters, as indicated by the increase of the cross entropy (green) 
for n > 4. The information loss estimator (red) is biased and continues to decreases with the addition of states as a consequence 
of over fitting. In an experimental context only the information can be computed since the true distribution is unknown. Panel 
C: Complexity of Change-Point Analysis. The true complexity is computed for the model shown in panel A via Monte Carlo 
simulation for 10® realizations of the observations and compared with three models for the complexity AIC, FIC and BIC. For 
models with states numbering 1 < n < 4, the true complexity (black) is correctly estimated by the AIC complexity (red dotted) 
and the FIC complexity (green). But for a larger number of states (4 < n < 8), only FIC accurately estimates the true complexity. 


VI. DISCUSSION 

In this paper, we present an information-based approach to change-point analysis using the Frequentist Informa¬ 
tion Criterion (FIC). The information-based approach to inference provides a powerful framework in which models 
wifh differenf parameferizafion, including differenf model dimension, can be compared fo defermine fhe mosf pre- 
dicfive model. The model wifh fhe smallesf information criferion has fhe besf expecfed predictive performance 
againsf a new dafasef. 

Our approach has fwo advanfages over exisfing frequenfisf-based rafio fesfs for change-poinf analysis: (i) We de¬ 
rive an FIC complexify fhaf depends only on fhe dimension of fhe sfafe model {-d), fhe number of sfafes (n) and 
observafions (N). Therefore if may be urmecessary fo develop and compufe custom sfafisfics for specific applica¬ 
tions. (ii) In fhe frequenfisf approach one musf specify an ad hoc confidence level fo perform fhe analysis. In fhe 
informafion-based approach, fhe confidence level is chosen aufomafically based upon fhe model complexify. The 
informafion-based approach is fherefore parameter and prior free. 

As fhe number of change-poinfs increases, fhe model complexify is observed fo fransifion between an AlC-like 
complexify ©(A°) and a Flannan-and-Quinn-like complexify ©(loglog Ai). We propose an approximate piecewise 
expression for fhis fransifion. The compufafion of fhis approximafe model complexify can be inferprefed as fhe 
expecfafion of fhe exfremum of a <i-dimensional Brownian bridge. We believe fhis informafion-based approach fo 
change-poinf analysis will be widely applicable. 








































































10 


Author Contributions 

P.A.W. and C.H.L. designed research; performed research; contributed analytic tools; analyzed data; or wrote the 
paper. 


Acknowledgments 

P.A.W. and C.H.L. would like to thank K. Burnham, J. Wellner, L. Weihs and M. Drton for advice and discussions, 
D. Dunlap and L. Finzi for experimental data and M. Linden and N. Kuwada for advice on the manuscript. This 
work was supported by NSF MCB grant 1243492. 


[1] M. A. Little and N. S. Jones, Proc Math Phys Eng Sci 467, 3088 (2011). 

[2] E. S. Page, Biometrika 42, 523 (1955). 

[3] E. S. Page, Biometrika 44, 248 (1957). 

[4] J. Chen and A. K. Gupta, Communications in Statistics-Simulation and Computation 30, 665 (2007). 

[5] M. A. Little and N. S. Jones, Proc Math Phys Eng Sci 467, 3115 (2011). 

[6] S. Kullback and R. Leibler, Annals of Mathematical Statistics 22, 79 (1951). 

[7] H. Akaike, in 2nd International Symposium of Information Theory., edited by P. B. N. and E. Csaki (Akademiai Kiado, Budapest., 
1973), pp. 267-281. 

[8] K. P. Burnham and D. R. Anderson, Model selection and multimodel inference. (Springer-Verlag New York, Inc., 1998), 2nd ed. 

[9] S. Watanabe, Algerbraic geometry and statistical learning theory. (Cambridge Univeristy Press, 2009). 

[10] P. A. Wiggins, In preparation. (2015). 

[11] Wikipedia, Brownian bridge — wikipedia, the free encyclopedia (2015), [Online; accessed 19-May-2015]. 

[12] L. Horvath, The Annals of Statistics 21, 671 (1993). 

[13] L. Horvath, P. Kokoszka, and J. Steinebach, Jounral of Multivariate Analysis 68, 96 (1999). 

[14] D. M. Hawkins, Journals of the american statistical association. 72,180 (1977). 

[15] E. Hannan and B. G. Quinn, Journal of the Royal Statistical Society, Series B. 41,190 (1979). 

[16] P. A. Wiggins, Submitted to Biophys J. (2015.). 

[17] P. A. Wiggins, In preparation (2015.). 

[18] A. Khinchine, Fundamenta Mathematica 6, 9 (1924). 

[19] A. Kolmogoroff, Mathematische Annalen 101,126 (1929). 

[20] Wikipedia, Law of the iterated logarithm — wikipedia, the free encyclopedia (2015), [Online; accessed 19-May-2015]. 

[21] D. A. Darling and P. Erdos, Duke Math J. 23,143 (1956). 

[22] Wikipedia, Gumbel distribution — wikipedia, the free encyclopedia (2015), [Online; accessed 19-May-2015]. 



11 


Global Binary-Segmentation Algorithm 


1. Initialize the change-point vector: i <— {1} 

2. Segment model .#(i): 

(a) Compute the entropy change that results from all possible new change-point indices j : 



(53) 

(b) Find the minimum information change Ahmin, and the corresponding index jmin- 

(c) If the information change plus the nesting complexity is less than zero: 


A/lmin + ^G- < 0 

(54) 

then accept the change-point j^in 

i. Add the new change-point to the change-point vector. 


^ {^1) • ■ ■ 5 jmin ■> • ■ - ^ri+l } 

(55) 

ii. Segment model Jt(i) 

(d) Else terminate the segmentation process. 



TABLE I: A global algorithm for binary segmentation. The information h is implicitly evaluated at the MLE state-model param¬ 
eters ©. 


1. Type I errors (false positives) 

In terms of the Cummulative Probability Distribution (CDF), the probability of a false positive change-point is: 

a = l-Fu{2U), (59) 

where U is the relevant change-point statistic and U is its expectation. Using the local binary-segmentation algo¬ 
rithm, a corresponds to the probability of a false positive per data partition and the change-point statistic is defined 
by Eqn. 41 evaluated at the average partition length Np = The false positive change-point acceptance probability 
is plotted in Figure 4. 

The analogous false positive rate for the global binary-segmentation algorithm describes the probability of a false 
positive in the entire data set, including all partitions. In this cases, we use the change-point statistic defined by 
Eqn. ??. 


2. Asymptotic form of the complexity function 

In order to discuss the scaling of the complexity relative to the BIC complexity, we need to derive an asymptotic 
form for the complexity in the large N limit. We do not recommend explicitly using this asymptotic expression for 
the complexity for Change-Point Analysis since it converges to the true complexity very slowly, especially for large 

-d. 

Eirst let us consider related results for and Brownian walk rather than a Brownian bridge. Let us define Sn as 
follows: 


Zn. 


\Zu\ 

(60) 

n 

(61) 


where the Zi are independent normally-distributed random variables with mean zero variance one per dimension 
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Local Binary-Segmentation Algorithm 


1. Initialize the change-point vector: i <— {1}, I 1. 

2. Segment model M{i) on state I: 

(a) Compute the entropy change that results from all possible new change-point indices j 

on the interval [ii,ii+i)- 

Ah, ^ ...}|A) - h{i\X), 

(56) 

(b) Find the minimum information change Ahmin, and the corresponding index jmin- 

(c) If the information change plus the nesting complexity is less than zero: 


A/lmin + ^ —L < 0 

(57) 

then accept the change-point j^in 

i. Add the new change-point to the change-point vector. 


^ {•■•5^/5 jmini ^J + l 5 ■••} 

(58) 

ii. Segment model M{i) on states / and / + 1. 

iii. Merge the resulting index lists. 

(d) Else terminate the segmentation process. 



TABLE II: A local algorithm for binary segmentation. The information h is implicitly evaluated at the MLE state model parame¬ 
ters 0. 



EIG. 4: Probability of a false positive change-point. The probability of a false positive change-point is shown as a fimction of 
the number of observations in the interval length N for three different model dimensions. 


-d. The Law of Iterated Logs states that [18-20]: 


lim sup 

n—^oo 


y/nlog logn 


•\/2 a.s., 


(62) 


where a.s. is the acronym for almost surely. (See Figure 5.) This behavior of Sn is described in more detail by the 
Darling-Erdos Theorem [21]. Let us define a new random variable 


U'iNp) 


Sn 

max 

l<n<Np y/n 


(63) 
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in -ot = 1 dimensions, the asymptotic cumulative distribution of U' approaches the cumulative distribution for a 
Gumbel Disfribufion [21]: 


lim Pr \U' < j3t + u] = exp \—e ‘1 , 

Np —^oo 


u{Np) = (3 (3 


^ 5 log log log Np - log 27r^/^ 


(64) 

(65) 

( 66 ) 


where Pr denofes probabilify and fhe disfribufion paramefers u and (3 are called fhe location and scale respectively 
and fhe average partifion lengfh is A'p = ^. Lef us infroduce fhe cumulafive disfribufion function for U: 


Fu{U) = Pr < U 

This expression can be reordered fo puf if in fhe canonical form of fhe Gumbel Disfribufion [22]: 


FuiU) = exp 


- exp - 


(-V^) 


(67) 


( 68 ) 


We can fhen use fhe well known expression in ferms fhe cdf fo compufe fhe cdf of fhe maximum of n random 
variables U': 


Pr [[/('„) < = FOiU), 

(69) 

= (exp [-exp 

) , (70) 

- exp [ exp( 

where 

(71) 

Un = u + piogn. 

(72) 

The mean and variance of fhe Gumbel Disfribufion are well known, allowing 

us fo compufe fhe expecfafion of 

« {Un + F > 

(73) 

« 21 oglogAp + 21 ogn + ... 

(74) 


where 7 is fhe Euler-Mascheroni consfanf and we have used fhe cancel nofafion fo show which ferms have been 
dropped fo lowesf order. In fhe second line, we have written fhe expression fo lowesf order in N and n. 

Horvafh has generalized fhe Darling-Erdos Theorem for a Brownian bridge in <i dimensions for fhe applicafion 
fo Change-Poinf Analysis in fhe contexf of fhe LPT fesf [12, 13]. The generalized expression for fhe cumulafive 
disfribufion leads fo a change in fhe expression for u only: 

(Ap) = /? [/3-2 + f log log log Np - (75) 

where P is fhe Gamma Euncfion. We drop fhe lasf ferm since if is nof leading order for large Np. We now follow 
fhe same sfeps fo generafe fhe disfribufion for fhe maximum of n random variables IF, leading fo a new Gumbel 
Disfribufion with location Hn.N- 


Un.d {Np) = (3 [F ^ f log log log Np + log n] (76) 

We now recompute the expectation for <i dimensions: 

1i^{Np,n,-d) = Ea; 17'(„)(Ap,n,-d), (77) 

« {un,d+/3lf + Y(^" ( 78 ) 

« 21oglogAp + 21ogn +-dlogloglogAp + ... (79) 

where we have kepf ferms only fo highesf order in n and Np. 
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Index n 


(B) 



FIG. 5: Panel A: Brownian Walk and Brownian Bridge. A visualization of a random walk X'^1, n] (blue) and the corresponding 
Brownian bridge B'„ (red). Panel B: Law of Iterated Logs. A visualization of logn (blue) plotted as an orthographic 

projection as a function of n. \/2 (red) is fhe limit of the supremum. 




















